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Abstract 

In this paper, we give a new generalization error bound of Multiple Kernel Learning (MKL) 
for a general class of regularizations, and discuss what kind of regularization gives a fa- 
vorable predictive accuracy. Our main target in this paper is dense type regularizations 
including i p -MKL. According to the recent numerical experiments, the sparse regularization 
does not necessarily show a good performance compared with dense type regularizations. 
Motivated by this fact, this paper gives a general theoretical tool to derive fast learning 
rates of MKL that is applicable to arbitrary mixed-norm-type regularizations in a unifying 
manner. This enables us to compare the generalization performances of various types of 
regularizations. As a consequence, we observe that the homogeneity of the complexities of 
candidate reproducing kernel Hilbcrt spaces (RKHSs) affects which regularization strategy 
(£i or dense) is preferred. In fact, in homogeneous complexity settings where the complex- 
ities of all RKHSs are evenly same, £i-regularization is optimal among all isotropic norms. 
On the other hand, in inhomogeneous complexity settings, dense type regularizations can 
show better learning rate than sparse €i-regularization. We also show that our learning 
rate achieves the minimax lower bound in homogeneous complexity settings. 
Keywords: Multiple Kernel Learning, Fast Learning Rate, Mini-max Lower Bound, 
Non-sparse, Generalization Error Bounds 



1. Introduction 



Multiple Kernel Learning (MKL) proposed by lLanckriet et al.l (|2004l ) is one of the most 
promising methods that adaptively select the kernel function in supervised kernel learn- 
i ng. Kernel method is widely used and several studies have supported its usefulness 



( Scholkopf and Smola . 2002 : Shawe- Taylor and Cristianini . 20041 ). However the perfor 



mance of kernel methods critically relies on the choice of the k ernel functi o n. M any methods 
have been proposed to deal with the issue of kernel selection. lOng et al.1 (120051 ) studied hy- 
perkernels as a kernel of kernel functions. lArgvriou et al.l (|2006l ) considered DC program- 
ming approach to learn a mixture of kernels with continuous parameters. Some studies 
tackled a problem to learn non-li near combination of kernels as in lBachl(|2009h : ICortes et al 



(|2009ah : lVarma and Babi] ifcood ). Among them, learning a linear combination of finite can- 
didate kernels with non-negative coefficients is the most basic, fundam ental and commonly 
used approach. The seminal work of MKL by Lanckriet et al. (2004) considered learning 
convex combination of candidate kern els as well a s its lin ear combination. This work opened 
up the sequence of the MKL studies. Bach et al. ( 20041 ) showed that MKL can be reformu- 
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lated as a kernel version of the group lasso ( Yuan and Lin . 20061 ). This formulation gives 
an insight that MKL can be described as a i\ -mixed-norm regularized method. As a gen- 
eralization of MKL, VMKL that imposes ^- mixed-norm regularization has been proposed 
(jMicchelli and Pontill J2OO5I ; iKloft et all 12003 ) . l v -MKL includes the original MKL as a spe- 
cial case as li-MKL. Another dir ection of generalization is elasticnet-MKL (|Shawe-Taylorl . 
20081 : iTomioka and Suzuki 120091 ) that imposes a mixture of ^i-mixed-norm and ^-mixed- 



norm regularizations. Recently numerical studies have shown that £ p -MKL wit h p > 1 and 
elast i cnet-MKL show bett e r performances than li -MKL in several situations (IKloft et al. 



20091 : ICortes et all l2009bl : ITomioka and Suzuki 120091 ). An interesting perception here is 



that both lp-MKL and elasticnet-MKL produce denser estimator than the original 4-MKL 
while they show favorable performances. The goal of this paper is to give a theoretical 
justification to these experimental results favorable for the dense type MKL methods. To 
this aim, we give a unifying framework to derive a fast learning rate of an arbitrary norm 
type regularization, and discuss which regularization is preferred depending on the problem 
settings. 

In the pioneering paper of lLanckriet et all (|2004l ). a convergence rate of MKL is given 

n is the number of samples. 



as 



— , where M is the number of given kernels and 



Srebro and Ben-Davi d (2006) gave improved learning bound utilizing the pseudo-dimension 



of the given kernel class. lYing and Campbell tod ) gave a convergence bound utilizing 
Rademacher chaos and gave some u pper bounds of the Rademacher chaos utilizing the 
pseudo-dimension of the kernel class. ICortes et al. (l2009bl ) presented a convergence bound 



for a learning method with L2 regularization on the kernel weight. 

^Ml for p = 1 and Ml 



Cortes et all (|2010l ) 



gave the convergence rate of £„-MKL as 



for 1 < p < 2. 



Kloft et al.1 (|201ll ) gave a similar convergence bound with improved constants. IKloft et al 



(2010) generalized this bound to a variant of the elasticnet type regularization and widened 
the effective range of p to all range of p > 1 while 1 < p < 2 had been imposed in the 
existing works. One concern about these bounds is that all bounds introduced above are 
"global" bounds in a sense that the bounds are applicable to all candidates of estimators. 
Consequently all convergence rate presented above are of order l/y/n with respect to the 
number n of samples. Howeve r, by utilizing the localization techniques including so-called 
local Radema cher complexity teartlett et all l K oltchinskii B) and peeling device 

(jvan de Geeri . l2000l ). we can derive a faster learning rate. Instead of uniformly bounding all 
candidates of estimators, the localized inequality focuses on a particular estimator such as 
empirical risk minimizer, thus can give a sharp convergence rate. 

Localized bounds of MKL have be e n given mainly in sparse learni ng settings 
( Koltchinskii and Yuan . 20081 : Meier et al. . 20091 : Koltchinskii and Yuan . 2O10l ). and there 
are only few studies for non-sparse settings in which the s parsity of the ground truth is not 
assumed. The first localized bound of MKL is derive d by Koltchinskii and Yuanl ( 20081 ) in 
the setting of 4-MKL. The second one was given by Meier et al.1 ( 2009) who gave a near 
optim al convergence rate for elasticnet type regularization. Recently Koltchinskii and Yuanl 
(|2O10l ) considered a variant of £i-MKL and showed it achieves the minimax optimal con- 
vergence rate. All these localized convergence rates were considered in sparse learning 
settings, and it has not been dis cussed how a dense type re gularization outperforms the 
sparse £i-regularization. Recently IKloft and Blanchardl (|201ll ) gave a localized convergence 
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bound of fp-MKL. However, their analysis assumed a strong condition where RKHSs have 
no-correlation to each other. 

In this paper, we show a unifying framework to derive fast convergence rates of MKL 
with various regularization types. The framework is applicable to arbitrary mixed-norm 
regularizations including £ p -MKL and elasticnet-MKL. Our learning rate utilizes the local- 
ization technique, thus is tighter than global ty pe learning rates. Moreover our analysis 
does not require no-correlation assumption as in iKloft and Blanchardl (j201lh . We discuss 



our bound in two situations: homogeneous complexity situation and inhomogeneous com- 
plexity situation where homogeneous complexity means that all RKHSs have the same 
complexities and inhomogeneous complexity means that the complexities of RKHSs are 
different to each other. In the homogeneous situation, we apply our general framework to 
some examples and show our bound achieves the minimax-optimal rate. As a by-product, 
we obtain a tighter convergence rate of £ p -MKL than existing results. Moreover we show 
that our bound indicates that ^i-MKL shows the best performance among all "isotropic" 
mixed-norm regularizations in homogeneous settings. Next we analyze our bound in in- 
homogeneous settings where the complexities of the RKHSs are not uniformly same. We 
show that dense type regularizations can give better generalization error bounds than the 
sparse l\ -regularization in the inhomogeneous setting. Here it should be noted that in real 
settings inhomogeneous complexity is more natural than homogeneous complexity. Finally 
we give numerical experiments to show the validity of the theoretical investigations. We see 
that the numerical experiments well support the theoretical findings. As far as the author 
knows, this is the first theoretical attempt to clearly show the inhomogeneous complexities 
are advantageous for dense type MKL. 



2. Preliminary 

In this section we give the problem formulation, the notations and the assumptions required 
for the convergence analysis. 

2.1 Problem Formulation 

Suppose that we are given n i.i.d. samples {(xi,yi)}f =1 distributed from a probability 
distribution P on XxM where X is an input space. We denote by n the marginal distribution 
of P on X. We are given M reproducing kernel Hilbert spaces (RKHS) {% m }^f =1 each of 
which is associated with a kernel k m . We consider a mixed-norm type regularization with 
respect to an arbitrary given norm || • ||^,, that is, the regularization is given by the norm 
W(\\fm\\n m )^=iU of the vector (\\f m \\nj^=i for f m e H m (m = 1, . . . , M0 For notational 
simplicity, we write \\f\\^ = \\(\\f m \\n m )m=ih for / = Ejti/m (fm G W ro ). 



1. We assume that the mixed-norm ||(||/m||'H m )m=i||i/> satisfies the triangular inequality with respect to 
(/ m )£=i, that is, ||(||/ m + fL\\n m )iLih < IKII/m|kJm=ilU + \K\\fL\\n m )%=i\U- To satisfy this 
condition, it is sufficient if the norm is monotone, i.e., ||a||^> < \\a + b\\^ for all a, b > 0. 
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The general formulation of MKL, we consider in this paper, fits a function / 



J2m=i fm (fm £ fim) to the data by solving the following optimization problem: 



M 



m=l 



arg mm 

/m6Wm (m=l,. 



1 71 



,M) 



n 



i=i 



M 

£ 

m=l 



/m(«i) + 



(1) 



We call this "Y'-norm MKL" . This formulation covers many practically used MKL methods 
(e.g., £ p -MKL, elasticnet-MKL, variable sparsity kernel learning (see later for their defini- 
tions)), a nd is solvable by a finite dimen sional optimization procedure due to the representer 
theorem (jKimeldorf and Wahbal . Il97ll ). In this paper, we mainly focus on the regression 
problem (the squared loss). However the discussion c an be generalized to Lipschitz contin- 
uous and strongly convex losses as in iBartlett et all (|2005l ) (see Section EJ). 



Example 1 : L-MKL The first motivating example of "0-norm MKL is £ p -MKL 



J— |_» ±V^ -1_ • ^^ -1Y±±>.JJ X XXV XXX O U UX V CXjUXXX^ ^VS.CdXXXJ^/XVj Ul U ' XXWX XXX 1VXXVXJ XkT 

jKloft et al.l . 120091 ) that employs £ „-norm for 1 < p < oo as the regularizer: 



\y ) M 

I rim J 7 



of 



m=iMp — (E m =i ll/m||^ m ) p ■ If P is strictly greater than 1 (p > 1), the solution 
p-MKL becomes dense . In particular, p = 2 cor responds to averaging candidate ker- 



nels with uniform weight ( Micchelli and Pontil . 20051 ) . It is reported that ^ p -MKL with p 
greater than 1. say p 
dCortes et aD . bold ). 



|, often shows better performance than the original sparse 4-MKL 



Example 2: Elasticnet-MKL The second example is elasticnet-MKL ([Shawe-Taylorl . 
20081 : iTomioka and Suzuki l200Sh that employs mixture of l\ and £2 norms as the regularizer: 

Wfh = rWflk + (1 - r)\\f\\ h = r Z^Lt \\fm\\n m + (1 - r)(E™ U WU^ with r G [0, 1]. 
Elasticnet-MKL shares the same spirit with £ p -MKL in a sense that it bridges sparse l\- 
regularization and d ense l^-regularization . Eff icient optimization method for elasticnet- 
MKL is proposed by ISuzuki and Tomiokal (|201ll ). 



Example 3: Variable Sp arsity Kernel Lea rning Variable Sparsity Kernel Learn- 
ing (VSKL) proposed by lAflalo et all (|201ll ) divides the RKHSs into W groups 



, Mi 



{%j,kYk=ii U = 1,...,M') and imposes a mixed norm regularization \\j\\^ = ||/||(p,g) 

{^2jLi(^h=i \\fjM\n jk )^} 9 wnere 1 < P, 1 < <7, and fj >k £ Hj jk . An advantageous point 
of VSKL is that by adjusting the parameters p and q, various levels of sparsity can be intro- 
duced. The parameters can control the level of sparsity within group and between groups. 
This point is beneficial especially for multi-modal tasks like object categorization. 



2.2 Notations and Assumptions 

Here, we prepare notations and assumptions that are used in the analysis. Let %® M = 'H 1 © 
■ ■ ■ ®%m- We utilize the same notation / G %® M indicating both the vector (/1, . . . , /m) 
and the function / = Yl m =i f m (f m e ^m)- This is a little abuse of notation because the 



decomposition / = Ylm=i fm might not be unique as an element of LziJI). However this 
will not cause any confusion. 



Th roughout the paper, we assume the following technical conditions (see also iBach 
(120081 )) . 
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Assumption 1 (Realizable Assumption) 

(Al) There exists /* = (/*,..., f* M ) e U® M such that E[F|X] 
and the noise e :=Y — f*(X) is bounded as |e| < L. 



Assumption 2 (Kernel Assumption) 

(A2) For each m = l,...,M, T~L m is separable (with respect to the RKHS norm) and 
su Pxex \k m (X,X)\ < 1. 

The first assumption in (Al) ensures the model J{® M is correctly specified, and the tech- 
nical assumption |e| < L allows ef to be Lipschitz continuo us with respect to /. The 
noise boundedness can be relaxed to unbounded situation as in Raskutti et al. (|2010h if we 
consider Gaussian noise, but we don't pursue that direction for simplicity. 

Let an integral operator T^ m : L^^R) — > ^(n) corresponding to a kernel function k m be 



T kmf 



k m (-,x)f(x)dll(x) 



It is known tha t this operator is compact, positive, and self-adjoint (see Theorem 4.27 of 
Steinwart ( 20081 )). Thus it has at most countably many non-negative eigenvalues. We denote 
by Hi m be the £-th largest eigenvalu e (with possible multiplicity) of the integral operator 
Tk m . By Theorem 4.27 of [Steinwart (l2008h . the sum of /x^ m is bounded (X^W.m < 
and thus /x^ m decreases with order l~ l = o{l~ 1 )). We further assume the sequence 

of the eigenvalues converges even faster to zero. 

Assumption 3 (Spectral Assumption) There exist < s m < 1 and < c such that 
(A3) Ht, m <cT~^, (W > 1, 1 < Vm < M), 

where {/J>£, m }^Li is the spectrum of the operator T^ m corresponding to the kernel k m . 
It was shown that the spectra l assum ption (A3) is equivalent to the classical covering number 



assumption (ISteinwart et all 12003 ) . Recall that the e-covering number N(e,B'n m , L2(II)) 
with respect to L2 (II) is the minimal number of ball s with radius e needed to cover the unit 
ball B-}i m in % m ( van der Vaart and Wellnei . 19961 ). If the spectral assumption (A3) and 
the boundedness assumption (A[2|) holds, there exists a constant C that depends only on s 
and c such that 



log N(e,B nm ,L 2 (U)) <Ce~ 2s " 



(2) 



and the converse is also true (see lSteinwart et al. (|2009l . Theorem 15) and lSteinwartl (|2008l ) 
for details). Therefore, if s m is large, the RKHSs are regarded as "complex", and if s m is 
small, the RKHSs are "simple". 

An important class of RKHSs where s m is known is Sobolev space. (AS]) holds with s m = 
^ for Sobole v space W a,2 {X) of q-times continuously differentiability on the Euclidean 
ball X of : 



on 



( Edmunds and Triebel . 1996). Moreover, fo r q-times d i fferen tiable kernels 
a closed Euclidean ball in R d (ABD holds for s m = ^ (jSteinwartl . 120081 . Theorem 6.26). 
According to Theorem 7.34 of Steinwart] ( 20081 ). for Gaussian kernels with compact support 
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distribution, that holds for arbitrary small < s m . The covering number of Gaussian 
kernel s with unbounded support distribution is also described in Theorem 7.34 of ISteinwart 
(|2008h . 

Let km be defined as follows: 



km '■= sup < k > 



K < 



112 



E m= i ll/"illL 2 (n) 



V/ m en m (m = l,...,M) 



(3) 



km represents the correlation of RKHSs. We assume all RKHSs are not completely corre- 
lated to each other. 

Assumption 4 (Incoherence Assumption) km is strictly bounded from below; there 
exists a constant Cq > such that 



(A4) 



< C n < K M - 



This condit i on is motivated by the incoherence condition ( Koltchinskii and Yuan . 20081 : 



Meier et all l2009h considered in sparse MKL setti ngs. This e nsures the uniqueness of the 
decomposition /* = Ylm=i fm °f the ground truth. Bach ( 20081 ) also assumed this condition 
to show the consistency of £i-MKL. 

Finally we give a technical assumption with respect to oo-norm. 

Assumption 5 (Embedded Assumption) Under the Spectral Assumption, there exists 
a constant C\ > such that 



(A5) 



< Ci| 



fm\\n r , 



\L 2 (ny 



This condition is met when the input distribution II has a density with respect to the 
uniform distribution on X that is bounded away from and the RKHSs are continuously 
embedded in a Sobolev space W a,2 (X) where s m = d is the dimension of the input 
space X and a is the "smoothness" of the Sobolev space. Many practically used kernels 
satisfy this condition (A5). For example, the RKHSs of Gaussian kernels can be embedded 
in all Sobolev spaces. Therefore the condition (A5) seems rather common and practical. 
More generally, there is a clear characterization of the condition (A5) in terms of real 
in terpolation of spaces. One can find detailed a nd formal discussions o f inte rpolations 



m 



Steinwart et al. (|2009h . and Proposition 2.10 of iBennett and Sharplevi gives the 



necessary and sufficient condition for the assumption (A5). 
Constants we use later are summarized in Table [TJ 



3. Convergence Rate of ^-norm MKL 

Here we derive the learning rate of ^-norm MKL in the most general setting. We suppose 
that the number of kernels M can increase along with the number of samples n. The 
motivation of our analysis is summarized as follows: 

• Give a unifying framework to derive a sharp convergence rate of ^-norm MKL. 

• (homogeneous complexity) Show the convergence rate of some examples using our 
general framework, prove its minimax-optimality, and show the optimality of l\- 
regularization under conditions that the complexities s m of all RKHSs are same. 
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Table 1: Summary of the constants we use in this article. 



n 


The number of samples. 


M 


The number of candidate kernels. 


L 


The bound of the noise (A[2|). 


c 


The coefficient for Spectral Assumption; see (A[3]). 




The decay rate of spectrum; see ( A[3j) . 


KM 


The smallest eigenvalue of the design matrix; see Eq. ([3]). 


Ci 


The coefficient for Embedded Assumption; see (A[5]). 



• (inhomogeneous complexity) Discuss how the dense type regularization outperforms 
sparse type regularization, when the complexities s m of all RKHSs are not uniformly 
same. 

We define 

v(t) '■= Vn(f) = max(l, y/i,t/\/n), 
for t > 0. For given positive reals {r m }^ =1 and given n, we define ai,at2, Pi, @2 as follows: 



M 

ot\ := ai({r m }) = 3 ( Yl 

Km=l 



a 2 := a 2 ({r m }) 



( M _ 2 *m(3- 3 m) \ 2 

Pi ■= =3 £ rm ^ , fh := /? 2 ({r m }) =3 

\m=l n 1+s ™ / 




m=l 



, (4) 



(note that ai, 02, A, /?2 implicitly depends on the reals {r m }^ =l ). Then the following 
theorem gives the general form of the learning rate of ^-norm MKL. 



Theorem 1 Suppose Assumptions 1-5 are satisfied. Let {r m }„ =1 be arbitrary positive reals 
that can depend on n, and assume > (§7) + fff J ■ Then there exists a constant <f) 
depending only on {s m }^ =1 , c, C\, L such that for all n and t' that satisfy -^=p < 1 and 
^ max{a 2 , (3f, Mlo ^ M) }rj{t') < ± and for all t>l, we have 



f 



* 1 1 2 



1 Mn) 



< 



247/0) 



12 

2 J.2 



KM 



■n 



(5) 



with probability 1 — exp(— t) — exp(— t'). In particular, for X\ 
24r/(t) 2 2 



+ 



II/-/ 



* ||2 



lL 2 (n) 



< 



n n MlOg(M) , 



n 



ax) \A 



we /lave 



iirnj- (6) 



The proof will be given in Appendix The statement of Theorem [T] itself is complicated. 
Thus we will show later concrete learning rates on some examples such as t p -MKL. The 
convergence rate ([6|) depends on the positive reals {r m }^f =1 , but the choice of {r m }^f =1 are 
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arbitrary. Thus by minimizing the right hand side of Eq 
bound as follows: 

2 



II/-/ 



*\\2 



i£ 2 (n)- 



Or, 



mm < 

. { r ™}m=l : 

r m >0 



a\ + Pi + 



Q: 2 

a i 



+ 



, we obtain tight convergence 
Mlog(M) V 



11/ 



* l|2 



+ 



(7) 



There is a trade-off between the first two terms (a) 



+ 



(fa 



+ 01 and the third term (b) := 

||/* H^, that is, if we take {r m } m large, then the term (a) becomes small 

and the term (b) becomes large, on the other hand, if we take {r m } m small, then it results 
in large (a) and small (b). Therefore we need to balance the two terms (a) and (b) to obtain 
the minimum in Eq. ([7]). 

We discuss the obtained learning rate in two situations, (i) homogeneous complexity 
situation, and (ii) inhomogeneous complexity situation: 

(i) (homogeneous) All s m s are same: there exists < s < 1 such that s m = s (Vm) (SecHJ). 

(ii) (inhomogeneous) All s m s are not same: there exist m,m' such that s m ^ s m i (Secj5]). 



4. Analysis on Homogeneous Settings 

Here we assume all s m s are same, say s m = s for all m (homogeneous setting). In this 
section, we give a simple upper bound of the minimum of the bound ([7]) (Sec J4.1j) . derive 
concrete convergence rates of some examples using the simple upper bound (Sec J4.2p and 
show that the simple upper bound achieves the minimax learning rate of ■i/'-norm ball if 
V>-norm is isotropic (Sec l4.3p . Finally we discuss the optimal regularization (Sec J4.4p . In 
Sec i4.2t we also discuss the difference between our bound of £ P -MKL and existing bounds. 



4.1 Simplification of Convergence Rate 

If we restrict the situation as all r m s are same (r m = r (Vm) for some r), then the min- 
imization in Eq. (|7|) can be easily carried out as in the following lemma. Let 1 be the 
M-dimensional vector each element of which is 1: 1:=(1,...,1) T S M M , and || • |L* be the 
dual norm of the ^-norrrH. 

Lemma 2 Suppose s m = s (Vm) with some < s < 1, and set 

I \ l-s 1 2 

Af ; = 18M^n"^||l||^,t s ||r||^ 1+s , then for all n and t' that satisfy 

^{^^m^rW^y^^^)^) < h andn > (||lMI/* U/M)& , 
and for all t > 1, we have 



2s M log(M) 



11/ - f H2 2 (n) < C V (tY jM 1 - — n~ (||l||^||r h)~ + 

with probability 1 — exp(— t) — exp(-t') where C is a constant depending on 4> and km- In 
particular we have 



OpjM^^n-i^dllll^ll/IU)^ + 



11/ " rilL(n) = O p \ M^^n-^dlll^liriU)^ + M1 °f (M) y (S) 



2. The dual of the norm || ■ ||^, is defined as : — su P a {b T a | !}• 
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The proof is given in Appendix IF. 11 Lemma [2] is derived by assuming r m = r (Vm) , which 
might make the bound loose. However, when the norm || • ||^, is isotropic (whose definition 
will appear later), that restriction (r m = r (Vm)) does not make the bound loose, that 
is, the upper bound obtained in Lemma [2] is tight and achieves the minimax optimal rate 
(the minimax optimal rate is the one that cannot be improved by any estimator). In the 
following, we investigate the general result of Lemma [2] through some important examples. 

4.2 Convergence Rate of Some Examples 

4.2.1 Convergence Rate of <! p -MKL 

Here we derive the convergence rate of £ p -MKL (1 < p < oo) where \\f\\^ = 

Sm=l(ll/m||% (f° r P = °°j ^ is defined as max m H/mll^™)- It is wen known that the 
dual norm of ^ p -norm is given as ^-norm where q is the real satisfying - + i = 1. For 

notational simplicity, let R p := (^m=i\\fm\\n m ) P ' Then substituting ||/*||^ = R p and 
i i i 

|| 1 Hi/,* = ||1||£ 9 = Mi = M p into the bound ©, the learning rate of l p -MKL is given as 

11/ - /*llL,n, =O r (n-^M^Rp + (9) 
If we further assume n is sufficiently large such that 

n > Mp Rp 2 (log M) 1 ^, (10) 
then the leading term is the first term, and thus we have 

11/ " /*IIL(II) = °P [n-^M l -l^)Rp^ . (11) 

Note that as the complexity s of RKHSs becomes small the convergence rate becomes fast. 
i__ 

It is known that n 1 + s is the minimax optimal learning rate for single kernel learning. 
The derived rate of £ P -MKL is obtained by multiplying a coefficient depending on M and 
R p to the optimal rate of single kernel learning. To investigate the dependency of R p to 

\M 
)m=l 



the learning rate, let us consider two extreme settings, i.e., sp arse setting (Jl f^ |j ?^ vW 



(1,0,..., 0) and dense setting (||/^||?Om=i = (1,...,1) as in lKloft et al.1 f|201lh . 

• (11/^11%™)™=! = (l>0>--->0) : -Rp = 1 for all p. Therefore the convergence rate 
n 1+3 M p( 1 + s ) is fast for small p and the minimum is achieved at p = 1. This 
means that i\ regularization is preferred for sparse truth. 

• (\\fm\\H m )m=i = (1, - - - , 1): R P = Mp , thus the convergence rate is Mn~~~ for all p. 

Interestingly for dense ground truth, there is no dependency of the convergence rate 

on the parameter p (later we will show that this is not the case in inhomogeneous 

setting (Secj5])). That is, the convergence rate is M times the optimal learning rate 

i 

of single kernel learning (n 1 + s ) for all p. This means that for the dense settings, the 
complexity of solving MKL problem is equivalent to that of solving M single kernel 
learning problems. 
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Comparison with Existing Bounds Here we compare the bound for £ p -MKL we de- 
rived above with the existing bounds. Let Hi p (R p ) be the £ p -mixed norm ball with radius 

R p : Hi p {Rp) := {/ = Em=i fm I (El=i ll/m||?£ m ) ? < R P }- There are two types of conver- 
gence rates: global bound and localized bound. 



(comparison with existing global bound) Cortes et al. ( 20ld ); iKloft et al.l (|201dl . l201lh 



gave "global" type bounds for £ P -MKL as 



R(f)<R(f) + C 




(for all / G He p (Rp)), 



(12) 



where R( f) and R { f) is the popu l ation risk and the empirical risk. The bounds by 
Cortes et al.l (j2O10h and IKloft et alJ |201l|l. are restr i cted to the situation 1 < p < 2. On 
the other hand, our analysis and that of Kloft et al. ( 20101 ) covers all p > 1. 

Since our bound is specialized to the regularized risk minimizer / defined at Eq. (pQ) 
while the existing bound (fT2|) is applicable to all / 6 Hi p (R p ), our bound is sharper than 
theirs for sufficiently large n. To see this, suppose that 



n > 



M 2 R^ 2 (logM)' 



l + s 
1-s 



MpR p 2 



(p = l) 
(p> 1) 



(13) 



1_ 1 2s _j_ j_l 

then we have n !+ s M p( 1 + s ) R p +S < n 2 (M p V log(M))i? p and hence our localized 
bound is sharper than the global one. Interestingly, the range of n presented in Eq. (|13|) 
where the localized bound exceeds the global bound is same (up to logM term) as the 

range presented in Eq. (|10p (n > Mp R~ 2 (log M)^~) where the first term in our bound ([9]) 
dominates its second term so that the simplified bound (jlip holds. That means that, at 
the "phase transition point" from global to localized bound, the first informative term in 
our bound becomes the leading term. 

Finally we note that, since s can be large as long as Spectral Assumption (A3) is satisfied, 
the bound (fT2|) is recovered by our analysis by approaching s to 1. 



(comparison with existing localized bound) Recentlv lKloft and Blanchardl (|201ll ) gave 
a tighter convergence rate utilizing the localization technique as 



r „2 



L 2 (n) 



mm„/ 



p'>p 



\p'-l 



/? 



(14) 



under a strong condition km = 1 that imposes all RKHSs are completely uncorrelated to 
each other. Comparing our bound with their result, there is min p '> p and ^-j- in their 

bound (if there is not the term ^rry, then the minimum of min p /> p is attained at p' = p, 
thus our bound is tighter). Due to this, we obtain a quite different consequence from theirs. 
According to our bound (jlip . the optimal regularization among all £ p -norm that gives the 
smallest generalization error is £i-regularization (this will be discussed later in Sec l4.4|) while 
their consequence says that the optimal p changes depending on the "sparsity" of the true 
function /*. Moreover we will observe that £i-regularization is optimal among all isotropic 
mixed-norm-type regularization. The details of the optimality will be discussed in Sec[ 
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4.2.2 Convergence Rate of Elasticnet-MKL 

Elasticnet-MKL employs a mixture of l\ and £2 norm as the regularizer: 

WfU=r\\f\\ h + il-r)\\f\U 2 

where rG [0,1]. 

Then its dual norm is given by ||6|L* 

by a simple calculation, we have ||1||^* = 
rate of elasticnet-MKL as 



mm aeR M <, max 



1—t+tVM 



I max ^ ^ a ^°° } H^_Mi2 ^ I _ Therefore 
Hence Eq. ([8]) gives the convergence 



c*l|2 



\LaQJ) 



O v \n ~ 



[1-t + t^/M)- 



< \\f*\\ ,h \\\f*\\ ^ , Aflog(M) 

( T \\f Ik + ( 1 -v\\f \\i2) 1+s + 



n 



Note that, when r = or r = 1, this rate is identical to that of ^2-MKL or £i-MKL obtained 
in Eq. © respectively. 

4.2.3 Convergence Rate of VSKL 

Variable Sparsity Kernel Learning (VSKL) employs a mixed norm regularization defined by 



T M> (yMi ||, „p 



where RKHSs are divided into M' groups {'Hj,k}^=i, = 1, - - - , M') and 1 < p, 1 < q 
Lemma 3 The dual of the mixed norm is given by 



for b jyk G M (k = 1,... ,Mj, j = 1,.. . ,M'). 

The proof will be given in Appendix IF.21 Therefore the dual norm of the vector 1 is given 



by ||1||^* = ( X^i=i I . Hence, by Eq. ([8]), the convergence rate of VSKL is given as 



II/-/*" 2 



L 2 (n) 



1 AS 

M 3 



W a* \ «* [ M' Mj 



H 



j=i k=i 



2s 
1 + s 



+ 



Mlog(M) 



One can check that this convergence rate coincides with that of £ P -MKL when M' = 1. 
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4.3 Minimax Lower Bound 

In this section, we show that the derived learning rate ([8]) achieves the minimax-learning 
rate on the -(/'-norm ball 

H^R) :=[f = Y% =l f m | \\fh<R), 
when the norm is isotropic. 

Definition 1 We say that ip-norm \\ ■ ||^ is isotropic when there exits a universal constant 
c such that 

cM = c\\l\\ h > \\l\\ r \\l\\^, ||&IU<||&% (ifO<b m <b' m (Vm)), (15) 

(note that the inverse inequality M < ||1||^*||1||^, of the first condition always holds by the 
definition of the dual norm). 

Practically used regularizations usually satisfy the isotropic property. In fact, £ p -MKL, 
elasticnet-MKL and VSKL satisfy the isotropic property with c = 1 . 

We derive the minimax learning rate in a simpler situation. First we assume that each 
RKHS is same as others. That is, the input vector is decomposed into M components like 
x = (x^\ . . . , x( M ^) where {x^ m ^}^ =1 are M i.i.d. copies of a random variable X, and 
n m = {fm I fm{x) = / m (x« . . . , x^) = f m (x^), f m G H} where H is an RKHS shared 
by all U m . Thus / G U® M is decomposed as f(x) = f(x^\. . .,x^) = £^f =1 f m {x {m) ) 
where each f m is a member of the common RKHS %. We denote by k the kernel associated 
with the RKHS H. 

In addition to the condition about the upper bound of spectrum (Spectral Assumption 
(A3)), we assume that the spectrum of all the RKHSs T-L m have the same lower bound of 
polynomial rate. 

Assumption 6 (Strong Spectral Assumption) There exist < s < 1 and < c, d 

such that 

(A6) j£-> <fj,£<c£-s, (1 < W), 

where {/^}^Li is the spectrum of the integral operator Tr corresponding to the kernel k. In 
particular, the spectrum ofT\~ m also satisfies fJ>e tTn ~ (W, m). 

Without loss of generality, we may assume that E[/(A)] = (V/ G TV). Since each f m 
receives i.i.d. copy of X, H m s are orthogonal to each other: 

E[f m (X)f m ,(X)] = E[f m (XW)f m ,(xW)) = 
(V/ m G H m , V/ m / G H m >, l<Vm/m'< M). 

We also assume that the noise {ej}" =1 is an i.i.d. normal sequence with standard deviation 
a > 0. 

Under the assumptions described above, we have the following minimax L2(n)-error. 
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Theorem 4 Suppose R > is given and n > m\\\\vi * s satisfied. Then the minimax 



A' 2 111 



4,* 



learning rate on H^(R) for isotropic norm \\ ■ \\^ is lower bounded as 



min max E 



r 2 



L 2 (n) 



> CM 1 ^(\\l\\ r R)^, (16) 



2s 1 2s 



where inf is taken over all measurable functions of n samples {{xi, y^lILi ■ 

The proof will be given in Appendix [Ej One can see that the convergence rate derived in 
Eq. ([8]) achieves the minimax rate on the -0-norm ball (Theorem H]) up to Mlo s( M ) th a t j s 
negligible when the number of samples is large. Indeed if 



M 2 log(M)" 



71 ^ n-,112 11^112 ■ ( 17 ) 

then the first term in Eq. ([8]) dominates the second term Mlo s( M ) anc [ ^ e U pp er bound 
coincides with the minimax optima rate. Note that the condition (|17p for the sample size n 

1 + s 

is equivalent to the condition for n assumed in Theorem [J] up to factors of log(M)~ and 
a constant. 

The fact that "0-norm MKL achieves the minimax optimal rate (|16p indicates that the 
^-norm regularization is well suited to make the estimator included in the ^-norm ball. 

4.4 Optimal Regularization Strategy 

Here we discuss which regularization gives the best performance based on the generalization 
error bound given by Lemma [2j Surprisingly the best regularization that gives the optimal 
performance among all isotropic if: -norm regularizations is £i-norm regularization. This can 
be seen as follows. According to Eq. ([8]), we have seen that the convergence rate of -0-norm 
MKL is upper bounded as 



riiLcn) = oJ M^n-i^uiii^iir 



2s_ Aflog(M) 
i +s + 

n 



and this is mini-max optimal on ^-norm ball if -0-norm is isotropic. Here by the definition 
of the dual norm || • ||^*, we always have 



M M 



link = E n/mii«« = E 1 x n^ii«m ^ ^u-wru- m 



m=l m=l 



Therefore the leading term of the convergence rate for £i-norm regularization is upper 
bounded by that for other arbitrary ip-norm regularization as 



2s 



(here it should be noticed that the dual norm of £ i-norm is -^oo-norm and HlH^^ — 1). This 
shows that the upper bound ([8]) is minimized by ^i-norm regularization. In other words, 
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^i-regularization is optimal among all (isotropic) -0-norm regularization in homogeneous 
settings. 

This consequence is different from that of Kloft and Blanchard ( 20 111 ) where the optimal 



regularization among £ p -MKL is discussed. Their consequence says that the best perfor- 
mance is achieved at p > 1 and the best p depends on the variation of the RKHS norms of 
{fm\m=v ^ /* i s close to sparse (i.e., ||/mllft m decays rapidly), small p is preferred, on the 
other hand if /* is dense (i.e., {||/mll'Hm}m=i i s uniform), then large p is preferred. This 
consequence seems reasonable, but our consequence is different: £i-norm regularization is 
always optimal in ^-regularizations. The antinomy of the two consequences comes from 
the additional terms min p /> p and in their bound (I14D (there are no such terms in our 
bound). This difference makes our bound tighter than their bound but simultaneously leads 
to a somewhat counter-intuitive consequence that is contrastive against the some experi- 
ment results supporting dense type regularization. However such experimental observations 
are justified by considering inhomogeneous settings. Here we should notice that the homoge- 
neous setting is quite restrictive and unrealistic because it is required that the complexities 
of all RKHSs are uniformly same. In real settings, it is natural to assume the complexities 
varies depending on RKHS (inhomogeneous). In the next section, we discuss how dense 
type regularizations outperform the ^i-regularization. 

5. Analysis on Inhomogeneous Settings 

In the previous sections (analysis on homogeneous settings), we have seen 4-MKL shows 
the best performance among isotropic ^-norm and have not observed any theoretical jus- 
tification supporting the fact that dense MKL methods like ^4-MKL can outperform the 

sparse h-MKL dCortes et alll20ld ). In this section, we show dense type regularizations can 



outperform the sparse regularization in inhomogeneous settings (where there exists m, m 1 
such that s m ^ s m i). For simplicity, we focus on ^ p -MKL, and discuss the relation between 
the learning rate and the norm parameter p. 

Let us consider an extreme situation where s\ = s for some < s < 1 and s m = (m > 
ljl. In this situation, we have 

l 2s(3-s) j (is) 2 

f r - 2s + M-i y srl- s /ri^+M-iy ^ 1+s 

"1 = 31— ,a 2 = 3—^,/3i = 31 -5- , P2 = 3 . 

for all p. Note that these ai, f$\ and f$2 have no dependency on p. Therefore the 
learning bound (JT]) is smallest when p = oo because H/*]^ < ||/*||&, for all 1 < p < oo. 
In particular, when (||/mllw m )m=i = 1> we have H/*^ = -^ll/*!!^ and thus obviously the 
learning rate of ^oo-MKL given by Eq. (|7|) is faster than that of £i-MKL. In fact, through a 

bit cumbersome calculation, one can check that ^oo-MKL can be at least M 1 + s times faster 
(up to constants) than 4-MKL in a worst case. Indeed we have the following learning rate 
of 4-MKL and ^-MKL (say /W and /(°°)). 



3. In our assumption s m should be greater than 0. However we formally put s m — (m > 1) for simplicity 
of discussion. For rigorous discussion, one might consider arbitrary small s m <C s. 
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Figure 1: The generalization error bound (|19p of ^„-MKL with respect to p 



Lemma 5 Suppose si = s for < s < 1 and s m = (m > 1) and \\fm\\u m 
n>M^V (Mlog(M)) 1 ^, then the bound (J7J) implies 



1 (Vm). 7/ 



(oo) 



/ 



*\\2 



\L 2 (n) 



f* ii 2 

J llLa(n) 



This indicates that when the complexities of RKHSs are inhomogeneous, the generaliza- 
tion ability of dense type regularization (e.g., ^oo-MKL) can be better than sparse type 
regularization (4-MKL). 

Next we numerically calculate the convergence rate: 



mm 

{ r ™}m = l : 

r m >0 



(19) 



Here we randomly generated s m from the uniform distribution on [0,1/3] and H/mll^m 
from the uniform distribution on [0, 1] with n = 100 and M = 10. Then calculated the 
minimum of Eq. (|19|) using a numerical optimization solver where £n- norm is employed 
as the regula r izer ( £ p -MKL). We used Differential Evolution techniquqj ( Price et al.1 . 120051 ; 
Chakraborty . 20081 ) to obtain the minimum value. Figure Q] plots the minimum value of 
Eq. (|19p against the parameter p of ^ p -norm. We can see that the generalization error once 
goes down and then goes up as p gets large. The optimal p is attained around p = 1.4 in 
this example. 

In real settings, it is likely that one uses various types of kernels and the complexities 
of RKHSs become inhomogeneous. As mentioned above, it has been often reported that 



4. We used the Matlab® code available in IChakrabortvl (|2008l ') . 
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4-MKL is outperformed by dense type MKL such as ^U-MKL in numerical experiments 
(jCortes et alll20ld ). Our theoretical analysis in this section well support these experimental 



results. 

6. Numerical Comparison between Homogeneous and Inhomogeneous 
Settings 

Here we investigate numerically how the inhomogeneity of the complexities affects the per- 
formances using synthetic data. In particular, we numerically compare two situations: (a) 
all complexities of RKHSs are same (homogeneous situation) and (b) one RKHS is complex 
and other RKHSs are evenly simple (inhomogeneous situation). 

The experimental settings are as follows. The input random variable is 20 dimensional 
vector x = (x^\ . . . , x^ 20 )) where each element 

x (m) ig 

independently identically distributed 

from the uniform distribution on [0, 1]: 



Unif([0, 1]) (m = 1,...,20). 



For each coordinate m = 1, . . . , 20, we put one Gaussian RKHS T-L m with a Gaussian width 
a m : the number of kernels is 20 (M = 20) and 

/ ( x (m) _ x /(m)\2\ 

k m (x,x ) = exp I j (?n = 1 , . . . , 20) , 

for x = (x^ 1 ', . . . ,x( 20 )) and x' = (x/W, . . . , x'^ 20 -*). To generate the ground truth /*, we 
randomly generated 5 center points /ij )JTi (i = 1, . . . , 5) for each coordinate m = 1, . . . , 20 
where /Uj jm is independently generated by the uniform distribution on [0, 1]. Then we obtain 
the following form of the true function: 

20 



m=l 

where f^(x) = ^a ijm exp I 2 ^ 2 t,m J G Km, 

for x = (xi, . . . , x m ). Each coefficient aj /m is independently identically distributed from the 
standard normal distribution. The output y is contaminated by a noise e where the noise e 
is distributed from the Gaussian distribution with mean and standard deviation 0.1: 

V = fm( x ) + e > 
e ~W(0,0.1). 

We generated 200 or 400 realizations {(£i;J/j)}™ = i {n = 200 or n = 400), and estimated 
/* using L-MKL with p = 1, 1.1, 1.2, . . . , 3 The estimator is computed with various 



5. We included a bias term in this experiment, that is, we fitted f(x) + b to the data: min/ m ,f, — '^2™- 1 (yi 
E^r/mW-bf+A^II/HL. 
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regularization parameters Aj . The generalization error ||/ — /*||£ 2 m) was numerically 
calculated. We repeated the experiments for 100 times, averaged the generalization errors 
over 100 repetitions for each p and each regularization parameter, and obtained the opti- 
mal average generalization error among all regularization parameters for each p. The true 
function was randomly generated for each repetition. We investigated the generalization 
errors in the following homogeneous and inhomogeneous settings: 

1. (homogeneous) a m = 0.5 for m = 1, . . . , 20. 

2. (inhomogeneous) a\ = 0.01 and a m = 0.5 for m = 2, . . . , 20. 

The difference between the above homogeneous and inhomogeneous settings is the value 
of o"i ; whether a\ = 0.5 or o\ = 0.01. The inhomogeneous situation is analogous to that 
investigated in Sec|5] where we assumed one RKHS is complex and the other RKHSs are 
evenly simple (small a\ corresponds to a complex RKHS). 

Figure [2] shows the average generalization errors in the homogeneous setting with (a) 
n = 200 and (c) n = 400, and the inhomogeneous setting with (b) n = 200 and (d) 
n = 400. Each broken line corresponds to one regularization parameter. The bold solid 
line shows the best (average) generalization error among all the regularization parameters. 
We can see that in the homogeneous setting ^i-regularization shows the best performance, 
on the other hand, in the inhomogeneous setting the best performance is achieved at p > 1 
for both n = 200 and 400. This experimental results beautifully matches the theoretical 
investigations. 



7. Generalization of loss function 

Here we discuss how a general loss function other than squared loss ca n be involved into ou r 
analysis. As in the standard local Rademacher complexity argument (jBartlett et al.1 . 120051 ). 
we consider a class of loss functions that are Lipschitz continuous and strongly convex. 
Suppose that the loss function ^ : R x R — > R satisfies Lipschitz continuity: for all R > 0, 
there exists a constant T(R) such that 

l%/i)-%/a)l<Wl-M (V/1,/2 eR such that \M\f 2 \ <R, Vj/Gl). 

(20) 

Moreover, suppose that, for all y 6 R, ^(y, f) is a strongly convex with a modulus p(R) > 0: 

*(y, A) + h) > ^ f h±h\ + eW lfl - /2 p 



2 ~ v 2 

(V/1,/2 G R such that |/i|, |/ 2 | < R). (21) 

Some detailed disc ussions about these conditions and examples can be found in 
Bartlett et al. I (|2006h . Under the loss functions satisfying these properties, we obtain sim- 



plified bound where some conditions can be omitted as follows: 

• We can remove the condition 4 ^^" max{a^,/3^, Mlo &( M ) }rj(t') < 

• The term exp(— t') is not needed in the tail probability. 
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(a) Homogeneous Setting (n = 200) 



(b) Inhomogeneous Setting (n = 200) 




(c) Homogeneous Setting (n = 400) 



(d) Inhomogeneous Setting (n = 400) 



Figure 2: The expected generalization error E[||/ — /*||2, a m)] a g a i ns t t ne parameter p for 
£p-MKL. Each broken line corresponds to one regularization parameter. The 
bold solid line shows the best generalization error among all the regularization 
parameters. 
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To obtain a fast convergence rate on a general loss functions we move the regu- 
larization term in Eq. (pQ) into a constraint, and then consider the following optimization 
problem: 



M 
m=l 



arg mm , 
f m &L m (m=i,...,M), n i=1 

\\fh<R 



^ N / M 



(22) 



m=l 



where R is a regularization parameter. The above optimization problem is essentially equiv- 
alent to the original formulation ([I]), but by considering the constraint type regularization 
instead of the penalty type regularization the theoretical analysis of statistical performance 
can be simplified. 

We define Pg as the expectation of a function g : R x R — >■ R: 

Pg :=E (xx ^ P [g(X,Y)]. 

For notational simplicity, we write P*(/) = PSSf(Y,f) = Et x ,Y)~p[^(X, /PO)] f° r a func- 
tion /. We suppose there exists a minimizer for P^{f) as follows. 

Assumption 7 (Minimizer Existence Assumption) 

There exists unique /* = (/*,..., f^) G %© M suc /i 



(A7) 



m=l 



argmin P* (V / m (X)) 

/m6«™ (m=l,...,M) \ m=1 / 



Note that, due to the incoherence assumption (Assumption [4]) and the strong convexity 
(|2ip of the loss function, if there exists a minimizer, then that is automatically unique. 
To bound the convergence rate on a general loss function, it is convenient to utilize local 



Rademacher complexity on ^-norm ball. Let Tl^ (R) := {/ G | ||/|| i2 ( n ) 

(r) 

R}. Then the local Rademacher complexity of H\ (R) is defined as 



< r, 



< 



Rn{U { ;\R)) :=E {(Ti)Xi}?=i 



sup 



1 n 



where cij G {±1} is the i.i.d. Rademacher random variable with P(o~i = 1) = P(o~i = 
— 1) = \. Evaluating the local Rademacher complexity is a key ingredient to show a fast 
convergence rate on a general loss function. We obtain the following estimation of the local 
Rademacher complexity (the proof will be given in Appendix IF. 4j) . 



Lemma 6 Let {r m }^f =1 be arbitrary positive reals. Under Assumptions 2-5, there exists a 



constant (f> depending on {s m }^ =1 ,c, C\ such that for all n satisfying 



log(M) 



< 1 we have 



Rn{nf{R))<^[ ai - 



+ a 2 R + Pi- 



IKM 



+ (3 2 R + 



Mlog(M) r 



n 



/K M 
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Finally note that the supremum norm of / with ||/||^ < R can be bounded as 

M M 



00 

m=l m=l 



Then, we obtain the excess risk bound as in the following theorem. 

Theorem 7 Suppose Assumptions 2-5 and 7 are satisfied and the loss function \P satisfies 
the conditions (|20p and (|2ip . Let {r m }^ =1 6e arbitrary positive reals that can depend on n 
and let T = T(||l||^,.i2) and p = p(\\l\\^*R). Set R = ||/*||^. Then there exists a constant 
4>' depending on {s m }^ =1 ,c, C\ such that for all n satisfying ^° s ^ < 1, we have 



P(*(/)-*(/t)) 

/ 2 „„ Mlog(M)\ ~.f 2 

K M V n J P 



—Y+ ( ^1 1 

Oil) \Pl 



n 

with probability 1 — exp(— t). 



+ {22f\\l\\ r R + 27p}t ^ 



This can be shown by applying the b ound of the local Rademacher complexity (Lemma 
EJ) to Corollary 5.3 of IBartlett et al.1 (j200.# l. Compared with the bound in Eq. ([6]), we 



notice that there is no exp(— t') term in the tail probability bound, and thus we don't need 
the condition maxjaf , f3f, M1 °g( M ) ~jr](t') < Because of this, the range of n where 
the error bound holds is relaxed compared with that in Theorem [TJ These simplifications 
are due to the Lipschitz continuity of the loss function. In Theorem [IJ we should have 
bounded the discrepancy between the empirical and population means of the squared loss: 
n Yli=i(f( x i) ~ f*( x i)) 2 ~ P{f ~ f*) 2 ■ Since the squared loss is not Lipschitz continuous, we 
required an additional bound for that discrepancy using Assumption [5] for the supremum 
norm, and it was shown that that discrepancy is negligible at the cost of exp(— t') in the 
tail probability. On the other hand, for Lipschitz continuous losses, we no longer need to 
bound such a quantity. Thus the tail probability loss exp(— t') is not induced. 

Since the bound (f23|) is basically same as Eq.©, we obtain the same discussions as in 
the previous sections. For example, in the homogeneous setting, we obtain the following 
convergence bound. 

Lemma 8 When s m = s (Vm) with some < s < 1, if we set R = \\f*\\^p, then for all n 
satisfying lo ^ f ) < l and n > (||1||^» ||/*||^/M) ^ , and for all t > 1, we have 

f 1 2s i 2s Mlos(M) t) 

p(*(/) - *(/*)) < c jM 1 -— n-^(\\i\\ r \\f*\y)T+; + + ->, 

with probability 1 — exp(— t) where C is a constant depending on eft' , km, p(\\l\\^*R), and 
T(\\l\\ r R). 



In Corollary 5.3 of IBartlett et al.l (|2005l ). the range of the function class is assumed to be included in 
the interval [—1,1]. Here we util ize more general settings w here the interval is [—a, a] and ||l||y,«_R is 
substituted to a. See Lemma 9 of iKloft and Blanchardl (|201lh . 
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8. Conclusion and Future Work 

We have shown a unifying framework to derive the learning rate of MKL with arbitrary 
mixed-norm-type regularization. To analyze the general result, we considered two situa- 
tions: homogeneous settings and inhomogeneous settings. We have seen that the conver- 
gence rate of £ p -MKL obtained in homogeneous settings is tighter and requires less restrictive 
condition than existing results. We have also shown convergence rates of some examples 
(elasticnet-MKL and VSKL), and proved the derived learning rate is minimax optimal 
when ^-norm is isotropic. An interesting consequence was that £i-regularization is optimal 
among all isotropic ^-norm regularization in homogeneous settings. In the analysis of in- 
homogeneous settings, we have shown that the dense type regularization can outperform 
the sparse i\ -regularization using analytically obtained bounds and numerically computed 
bounds. We observed that our bound well explains the experimental results favorable for 
dense type MKL. Finally we numerically investigated the generalization errors of £ p -MKL 
in a homogeneous setting and an inhomogeneous setting. The numerical experiments sup- 
ported the theoretical findings that £i-regularization is optimal in homogeneous settings 
but, on the other hand, dense type regularizations are preferred in inhomogeneous settings. 
This is the first result that suggests that the inhomogeneity of the complexities of RKHSs 
well justifies the favorable performances for dense type MKL. 

An interesting future work is about the Mlo s( M ) t erm appeared in the bound Eq. ([8]). 
Because of this term, our bound is 0(M log(M)) with respect to M while in the existing 
work that is 0( v / log(M) V Af *) for £ p -MKL. Therefore our bound is not tight in the 
global bound regime (n < MpR~ 2 log (M)^~ for ^ p -MKL). It is an interesting issue to 
clarify whether the term Mlo s( M ) can ^ e re pl ace d by other tighter bounds or not. To do so, 
it might be helpful to comb ine our technique developed in this paper and that developed by 
Kloft and Blanchardl (|201lh where the local Rademacher complexity for ^ p -MKL is derived. 
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Appendix A. Relation between Entropy Number and Spectral Condition 

Associated with the e-covering number, the i-th entropy number ei(H m —> LaQI)) is defined 
as the infimum over all e > for which N(e, £>% m , L 2 (II)) < 2 l ~ l . If the spectral assumption 
(A3) and the boundedness assumption (A[2D hold, the relation ([2]) implies that the i-th 
entropy number is bounded as 

ei (n m ^ l 2 (u)) <cr£, (24) 

where C is a constant. To bound empirical process a bound of the entropy number with 
respect to the empirical distributio n is needed . The following proposition gives an upper 
bound of that (see Corollary 7.31 of ISteinwart (2008), for example). 
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Proposition 9 If there exists constants < s < 1 and C > 1 such that ei(H m — > L2(n)) < 
Ci~2^, then there exists a constant c s > only depending on s such that 

Ei) n ~n"[ e i(^m -> L 2 (-D n ))] < c s C(min(i,n))2Jz"", 

in particular E Dn ^u n \&i^im -> -^(Ai))] < c s Ci~~s . 



Appendix B. Basic Propositions 

The following two propositions are keys to prove Theorem [TJ Let {cjj}™ =1 be i.i.d. 
Rademacher random variables, i.e., o~i G {±1} and P(<7j = 1) = P(<Tj = —1) = \. 



Proposition 10 (Steinwart, 20081 . Theorem 7.16) Let B a ^ a ^ C % m be a set such that 



") II J m ||oo 



£><7,a,b — {fm G %m I ||/mlU 2 (n) - a i WfmWUm - a 

constants < s < 1 and < c s suc/i i/iai 

E Al [ei(^ m -> L 2 (D„))] < c s r^ 
Then there exists a constant C' s depending only s such that 



< b}. Assume that there exist 



E 



sup 



i=l 



<C> 



a 1 s (c s a) 



n 



V (dsaji+sb^sn i+« 



(25) 



Proposition 11 ( Talagrand's Concentration Inequality Talagrandl (1996); 
Bousauetl (j2002h ) Lei G be a function class on X that is separable with respect to oo- 
norm, and {xi\™ =l be i.i.d. random variables with values in X . Furthermore, let B > and 
U > be B := sup 96 gE[(g — E[g]) 2 ] and U := sup 96 g ||<?||oo; then there exists a universal 
constant K such that, for Z := sup gg g |^ ^2i=i9( x i) — E[p]|, we have 



P \ Z > K 



E[Z] + ,/- + - 
V n n 



< e" 



Appendix C. Proof of Theorem [T] 

Let r m > (m = 1, ...,M) be arbitrary positive reals. Given {r m }^f =1 , we determine 
U$$ m (f m ) as follows: 

/ _ %(3-%) 

UttL(fm) :=3 [ V ^ 1(11/, 



■n 



n 1 + s 



m 11/mllWm) + y H/mlU 2 (n)- 



■n 



It is easy to see U^l > m (f m ) is an upper bound of the quantity 



II f ||1— % || f \\ 3 m 



V 



II/" 



| H% 

'i-2(n) 



II/™ 



Sm(3-%) 
1+% 



n 1 + s r, 



(this corresponds to the RHS of Eq. (J2SD) because 



ll/ m llL 2 (ff)ll^ m ll« 



m ' m 



r? 



n 



II/, 



mllL 2 (n) 



1— Sn 
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(Young) r l-s m / ||/ m || La(n) 

< — — S m J rSmll/mllWr, 



7? 



(ll/m||z, 2 (n) + s m r 



m\\Jm\\H, 







(26) 



where we used Young's inequality a 1 Sm b Sm < (1 — s m )a + s m 6 in the second line, and 
similarly we obtain 



II/, 



1+% II -f || 1+% 



i£ 2 (n) 



ll/n 



l + sm 



77, 1+% 



< 



< 3- 



fl l + s m 

_ %(3-%) 



— I ll/m|U 2 (n) H r 



^m(3 Sm) 



+ S r 



77 l+sm 



(||/m|U 2 (n) + s m r m ||/ m ||ft ro ) 



where we used Sm 1 ( ^/ m) < 3s m in the last inequality. 



Now we define 



:= max 



[KL 



2C* + 1 + Ci 



2CiC* + Ci + 



where C* is a constant defined later in LemmafTB] C\ is the one introduced in Assumption [5j 
K is the universal constant appeared in Talagrand's concentration inequality (Proposition 
[TT]) and L is the one introduced in Assumption Q] to bound the magnitude of noise. Remind 
the definition of 77(f): 

77(f) := 7/ n (f) = max(l, \/t,t/y/n). 
We define events £\{t) and ^(f') as 



1 n 



i=l 



< <PU(f(f m )v(t), V/ m € ft m (m = 1,... ,M) L (27) 



m=l /"i ~ 2_/m=l /' 



,1/ 



i 2 (n) 

V/ m €ft m (m = l,...,MU. 



m=l 



(28) 



Using Lemmas [T71 and [181 that will be shown in Appendix[Dl we see that the events S\(t) 
and #2(f') occur with probability no less than 1 — exp(— f) and 1 — exp(— f') respectively as 
in the following Lemma. 

Lemma 12 Under the Basic Assumption (Assumption^, the Spectral Assumption (As- 
sumption^ and the Embedded Assumption (Assumption^, the probabilities of <?i(f) and 
£2 are bounded as 



P(Mt)) > 1 - exp(-f), PW)) > 1 - exp(-f'). 
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Proof Lemma [18] immediately gives P{S\{t)) > 1 — exp(— t) by noticing <ft in the statement 
of Lemma [18] satisfies (j) < <ft. Moreover, since 4>' in the statement of Lemma [T7] satisfies 
4>' < (p, we have P(£ 2 (t')) > 1 - exp(-t') by Lemma El ■ 



Remind the definition (jU) of oti, a 2 , Pi, Pi'- 



ol\ = 3 



\m=l 
M 



n 



a 2 = 3 



n 



M 



m=l 



A = 3 IE 



2s m (3 — sm) \ 


1 

2 






/ 


(i-%) J \ 


M 






ft = 3 






'm' ti 


1 














n 1 + s m J 










n 1 + s m i 














m=l 



(29) 



for given reals {?~m}m=i- The following theorem immediately gives Theorem [T] 



Theorem 13 Suppose Assumptions 1-4 are satisfied. Let {r m }^f =1 be arbitrary positive 
reals that can depend on n, and assume > (§7) + (§r) • Then for all n and t' that 
satisfy J2S(|Q < l and max{af,/3 2 , < ^ and for all t>\, we have 



f 



* Il2 



I £2 (II) 



< 



247/0) V / 2 2 Mlog(M) 



of + # + 



+ 4Ah|f|| 2 ,. 



with probability 1 — exp(— t) — exp(— i'). 



Proof [Proof of Theorem [13] By the assumption of the theorem, we can assume Lemma 
[T2l holds, that is, the event S\{t) n <? 2 (i') occurs with probability 1 — exp(— t) — exp(— t'). 
Below we discuss on the event S\(t) fl S^if). 
Since yi = f*{xi) + £j, we have 



ll/-rili 2( n) + Ai' 



(™)|| £ ||2 



n M 



<(ii/ - r Him - 11/ - r ID + - E E e *(^) - + *r \\f 



(n) n f * M 2 



i=l m=l 

Here on the event ^(t 7 ), the above inequality gives 



f*ll 2 1 \v 
/ llL 2 (n) + A i 



M|| £||2 



n M 



Vm=l 



i=l m=l 



Before we prove the statements, we show an upper bound of J2m=i UnJsLifm) required 
in the proof. By definition, we have 
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l + sm 



V 



I? 



fl 1+% 



(ll/mlk 2 (n) + s m r m \\f m \\ Hm ) + \J— ^- — -||/ m ||L 2 (n) 



11 



Sm(3 — s m ) 
1+% 



<3^^- (||/m|U 2 (n) + S m r m \\fm\\n m ) + 3 ; (||/mlU 2 (n) + Wmll/mllO (31) 



7T, 1+% 



+ 



log(M) 



n 



m\\L 2 (U)- 



(32) 



Now the sum of the first term is bounded as 

M 

r 



3-^=- (||/m|U 2 (n) + Smr m \\f m \\% m ) 

m=l » 



=3 £ -%-H/m||L 2 (n) + 3 E 

m=l m=l 



,1— s„ 



11 



■II/, 



\m=l 



li 2 (n) 



+ 3 



■S' I , , 7 



1-s \ M 

- 1 - b m \ 



m=l 



where we used Cauchy-Schwarz inequality and the duality of the norm in the last inequality. 
The sum of the second term of the RHS of Eq. (j32|) is bounded as 



M 



i (3~ s m ) 



3 m i (ll/m||i 2 (n) + Wmll/mlkm) 



=^E 



Sm(3 — Sm) 
1+% 



A/ 



(l~£mT 



= 1 n 1 + s 



T~ IUm||L 2 (n) 



l+s 



TmuHr, 



m=l 



i 



_ 2s m (3-s m ) \ 2 
M „ 1+% \ / 717 



E ii/" 



lL 2 (n) 



+ 3 



\m=l 



am) 



2 \ A/ 



$mTr. 



\ 



fl l + sm 



m=l 



■0* 



where we used Cauchy-Schwarz inequality and the duality of the norm in the last inequality. 
Finally we have the following bound of the third term of the RHS of Eq. ([32]): 



M 

E 

m=l 



log(M) 



n 



||/m|U 2 (n) < 



Mlog(M) 



n 



M 



En/" 



\L 2 (U) 



\m=l 



Combine these inequalities and the relation Ylm=i ll/mlli 2 (n) — ^H/Hi 2 (n) (Assumption 
H]) to obtain 

M 



E UnJ m {fn 



m=l 
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m 



-2s n 



L 2 (U) 



\m=l 



n 



Al 



+ 3 E 



'km 



2s m (3-s m ) \ 2 



+ 3 



s m r m 



1 — Sm 



n 



M 



m=l 



WfU 



L 2 (n) 



=1 nH% 



'KM 



+ 3 



v.* 

/ (1-%) 



n 1 + s m 



2 \ M 



m=l 



WfU 



Mlog(M) ||/||i 2 (n) 



n 



'km 



Then by the definition ^ of a\, a^-, /3\, fii, we have 
M 



m=l 



, H/lU 2 (n) , , n \\fhm , fl ,. ... /Mlog(M) ||/|U a( n) 
<«i + a2||/|lv + A ^==^ + fa\\fU + 



'KM 



'Km 



'KM 



By Eq. (f34"|) . the first term on the RHS of Eq. ([30]) can be upper bounded as 



M 



\m=l 



f* 1 1 2 



11/ /*lll 2 (n) 2|l i ,* l|2 , ,, 2 II/ ^ U£a(n) 



km 



+ «2ii/-ri^ + /3 



km 



+ 



ft 2 ii/-riii + Mlog(M)ll/ " rilL ™ 



n 



KM 



< 



afi7(0( ii/-riii 2 (n) + (^) ii/ -rift 



f*l|2 



+ 



+ 



KM 



02 V 



f*l|2 



Q!l J 



ii/-riii 2{ n) + (f) ii/- mj 



KM 

40^/nMlog(M) , 



KM n 



rtOII/-/ 



*l|2 



ii 2 (n)- 



(33) 



(34) 



By assumption, we have 4 ^ max{a|, (3f, M1 °g( A:f ) }rf(i?) < j^. Hence the RHS of the above 
inequality is bounded by 



M 



< 



ii/-rni 2( n) + 



ii/-ni$ 



(35) 
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Step 2. On the event S\ (t) , we have 



M 



M 
m=l 



i=l m=l m=l 

<2r,(t)<f> ai I^ZpMS +a2 ||/_ r[ | +j g 2 ||/_ r || 

(vEq.©) 



u j Mlog(M) ||/-/*||L 2 (n) 



n 



< 2 



^(ll/-HMa, + *l/ -/•■*)+ * 

a /km V «1 / 



^^/-/•iiMn, + |ii/-riu 



< 



^^("/-/•'iMo,^ii/-/i; 



+ 



KM 
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67K*)VMlog(M) , 1 „* ,* ||2 



KM 
2^2 



< 



KM 



1 + H 



+ 



KM 



ii/-riil 2 (n) + (^) 2 n/-re 

-/*lll 2 (n) + f^J II/-/ 



< 



, 6T/(t)V 2 Mlog(M) 1 ; 2 
+ ^M n + 12 ll/_/ 

v ^™)+>~-rii! 2(n) + 

km V n / 4 I - 



12fy(f)V / , 2 
«1 + Pi + 

KM V 



n/-rnj 

(36) ' 



5iep 5. 

Substituting the inequalities (|3"5|) and 



to Eq. ([3U|) . we obtain 



ll/-/*ll! 2 (n) + Ani/ll^ 



to I 



< 



127?(t) 



2 J.2 



KM 

+ Ai n) iiri^. 



2 2 Mlog(Af) \ 1 
"1 + P1 + 1 + 2 



f* II 2 



lL 2 (n) 



+ 



"2\ / #2 



7 1 1 v 
(37) 
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Now, by the triangular inequality, the term \\f — can be bounded as 

ii/ - rni < (\\fh + wru) 2 < 2 (\\f\\i + uriij) • 

Thus, when A[ n) > + {j^f , Eq. ((37]) yields 

1 11/ _ n?Lm < iwv ( Q . s + * + EisiM) + 2A ....|i r I,.. 

Therefore by multiplying 2 to both sides, we have 

p ,* ||2 24rK*)V / 2 2 Mlog(M) \ (n) 2 
11/ " / IL 2 (n) < («i + ft + j + 4Aj ||/ ||„. 

This gives the assertion. ■ 



Appendix D. Bounding the Probabilities of <£i(i) and ^(f) 

Here we derive bounds of the probabilities of the events $\{t) and S^ft) (see Eq. (j27|) and 
Eq. ()28p for their definitions). The goal of this section is to derive Lemmas 1171 and 1181 
Using Propositions [11] and [TOl we obtain the following ratio type uniform bound. 

Lemma 14 Under the Spectral Assumption (Assumption^ and the Embedded Assumption 
(Assumption^), there exists a constant C Sm depending only oh Syjij c ciTid C\ such that 



E 



sup 

fm £"Hm '■ || fm || Tim = ^ 



n ^i=l a ifrn{Xi) 

Tj(m) (f \ 
Vn,Sm \Jm) 



<c„ 



Proof [Proof of Lemma E] Let H m (8) := {f m £ U m | ||/ m ||w m = 1, ||/m|U 2 (n) < 8} 
and z = 2 1 / Sm > 1. Define t '. — s m r m . Then by combining Propositions [9l and 1101 with 
Assumption [5] we have 



E 



<E 



sup 

fm&Hrn-\\frn\\n r , 



n J2i=l °~ifm(Xj) 



sup 



I n Si=l °~ifm{%i)\ 



u n,s m \Ji 



k=l 



sup 

f m €'H m (TZ k )\'H m (TZ k - 1 ) 



I n Si=l °~ifm(xi)\ 



m, (1-%) 



V 



v 7 " 



1+% 



z '(l-%) r l-%j s ™ 



1 



) 2 2s m 
1+% 



5X 

fe=i 



V 



. 1+sn 



■TZ 



k-1 



l + sn 

2 

1 



fc-i 
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s 



where we used s m Sm < 3 for < s m in the last line. Thus by setting, 



%[3-%) 



C Sm = 9C' S c*» V a 1+am cl+ sm 1 + t^-^t V ^ tr s « , we obtain the as- 

V /V i-z i+% ' 

sertion. 



This lemma immediately gives the following corollary. 

Corollary 15 Under the Spectral Assumption (Assumption^ and the Embedded Assump- 
tion (Assumption \B\), there exists a constant C Sm depending only on s m i c and C\ such 
that 



E 



sup 



y~)j—l 0~ifm{%i 



1 1 V^™ 

I n 



Tj(m) (f 
u n,s m \Jn 



<c„ 



Proof By dividing the denominator and the numerator by the RKHS norm ||/m||w mJ we 
have 

In Si=l cr ifm(Xi J 



E 
=E 
=E 
=E 

<a 



sup 



sup 



sup 

r, 



rr(m) /, \ 

\h T!j=l<rifm(Xi)\/\\fm\\Hr, 

Un!sL(fm)/\\frn,\\n m 
\n Si=l °'ifm{Xi)/\\f m \\-H rn 
x in {fm/\\fm\\H m ) 
In Si=l °~ifm(xi 



sup 

fm&'Hm'-\\fm\\-H. m =l- 



Tj(m) /, \ 
u n,s m \Jm) 



(y Lemma [T 



Lemma 16 If log ^^ < 1, i/ien under the Spectral Assumption (Assumption^) and the Em- 
bedded Assumption (Assumption^) there exists a constant C* depending only on {s m }^ =1) 
c, C\ such that 



E 



max sup 

7. 



I n ^2j=l a ifm{ x i)\ 



< c*. 
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Proof [Proof of Lemma [To] First notice that the L2(LT)-norm and the oo-norm of ^"^^ 
can be evaluated by 




[|/» 



lL 2 (n) 



rrM /, 
L 2 (n) u n,s m KJn 



< 



ll/m||z, 2 (n) 



log(M) 



< 



II/; 



/m||i 2 (n) 

1 1 Sm 1 1 £ II Sm 



n 



log(M)' 



(38) 



„ ;|! . x ^ Clll/mllij^ll/mll^ C 



< 



< -jVn < CiVTi, (39) 



where the second line is shown by using the relation f)26[) . Let C* := max m C Sm where C Sm 
is the constant appeared in Lemma HU Thus Talagrand's inequality and Corollary 1151 imply 



P max sup 



n Si=l a ifm(Xi 



fm&im Un,s m {fm) 



> K 



M 

s5> 



In Z]j=l gjjm(-P*)l . 
SUp ; — ; > A 



m=l Un,s m {fm) 



<^p( sup ^g^M> K 
m =l \fmeH m Un,s m (fm) 



a + 



+ 



log(M) ^ 



+ 



log(M) 



+ 



log(M) y/n 



<Me~K 

By setting t <— t + log(M), we obtain 

I n Si=l a ifm( x i)\ 



P max sup 



> K 



c* + 



't + log(M) | Ci(t + log(M)) 



log(M) V" 
for all t > 0. Consequently the expectation of the max-sup term can be bounded as 

In Si=l a ifm{Xi) 



< e 



-t 



E 



max sup ( 

m f m &H m Un,s m (fm) 

d log(M) 



/>oo 




+ / 




JO 





i+l + log(M) Ci(t + l + log(Af)) 



<2K 



c , + ^ + ,/zzz: + c.(^m^» 

41og(M) 



log(M) 



+ 



71 



e _t dt 



where we used + 1 + log(M) < V* + V 1 + log(M) and / °° y/te^dt = l - 2E ^f L < 1, 
and C* = 2if [C* + ^2 + ^ + 3d]. ' ■ 



Lemma 17 Suppose the Basic Assumption (Assumption [7J) ; the Spectral Assumption 
(Assumption and the Embedded Assumption (Assumption [5|) hold. Define (f> = 



KL 



2C* + 1 + Ci 



. lo ^/^ < 1, i/ien i/ie following holds 
F I max sup — ^ > <pr)\t) I < e 



t 
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Proof [Proof of Lemma [T7] By the contraction inequality (jLedoux and Talagrandl . Il99ll . 
Theorem 4.12) and Lemma [T6l we have 



E 



max sup 

m f TTV* 1 ) 



n Si=l e ifm{Xi) 



< 2E 



I „ yZi— 1 &i£ifm{Xi 

max sup — : -. — r 

m f ^oj TT\ m > 



fm&im U n)Sm (f„ 



< 2LC*, 



where we used e» < L (Basic Assumption). Using this and Eq. (j38l) and Eq. (f39l) . Talgrand's 
inequality gives 



n| I n Yli=l e ifm{%ij , ^ , , , 

1^ max sup — — > KL 



fm&-Lm Un,s m (fm) 



2(% + y/t + 



Thus we have 



P max sup — — > AL 



2C* + 1 + Ci 



max 



fm&im Un,sL(fm) 

Therefore by the definition of (f> and r](t), we obtain the assertion. 



i,Vi, 



< e 



Lemma 18 Suppose the Basic Assumption (Assumption^) , the Spectral Assumption (As- 
sumption^ and the Embedded Assumption (Assumption^ hold. Let </>' = K[2C\C* + C\ + 
Cf]. Then, if < 1, we have for all t > 



Em=l fn 



Em=l frr 



L 3 (U) 



M 



<<t>'V^[J2 U tUfm)) ri(t), 



\m=l 



for all f m S Tim (m = 1, . . . , M) with probability 1 — exp(— t). 
Proof [Proof of Lemma [TH] 



E 



<2E 



sup 



sup 



EM f 
m=l Jn 



EM r 
m=l 



L 2 (n) 



sr^M j Am) , , 



n Sj=l ff i(Em=l fm( x i)) 



fm&im j^M jj(m) /, 



< sup 



fm&im Ylm=l Un,s m (fi 



x 2E 



sup 

n 



v^M rr (m) / f 
Z^m=l ^"^m Un 



(40) 
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where we used the contraction inequality in the last line ([Ledoux and Talagrandl . 11991 



Theorem 4.12). Thus using Eq. (|39p. the RHS of the inequality ()40|) can be bounded as 



2Ci\/nE 

<2C ly /nE 
where we used the relation 



sup 



n Ei=l ff i(Em=l fm{Xi)) 



sup max ■ 

f c.%1 m 

J m t iLm 



n Ei=l a ifm{. x i)\ 



Em a m 
Em b m 



< max 



(41) 



for all a m > and b m > with a convention jj = 0. By Lemma [To] the right hand side 
is upper bounded by 2Ciy/nC*. Here we again apply Talagrand's concentration inequality, 
then we have 



/ 



sup 

fm&H 7t 



EM £ _ r-MM p 

m =lJm — l^ m =\ i'i 



> K 



V 



sr^M T Am) ,, 
Z^m=l u n,s m KJm 

where we substituted the following upper bounds of B and U. 

( 



2CiC*Vn + Vt^Ci + C\t 



\ 



J 



B < sup E 



< sup E 

■Jii 



(Em=l fn 



l / v-M rr(m) , , 
\ lZ^m=l u n,Sm\J', 



(Em=l /" 



Em=l /mlloo)^ 



2 



< sup 



T Am) 



v^A/ rr(m) / f 



J38l ) 1 

~<2„2 1 



2„2 



log(M) 

where in the second inequality we used the relation 

2 



E 



EM £ 
m=l Jn 



E 



Em,m'=l fmfm' < Em,ra'=l II /m|U 2 (II) l|/m' IU 2 (n) — (Em=l ||/m||x 2 (n)) 



Ai 



and in the third and forth inequality we used Eq. (j39[) and Eq. (j38[) with Eq. (|4ip respectively. 
Here we again use Eq. (|38p with Eq. (|4ip to obtain 



U = sup 

fm&H 



(Em=l fn 



sr^M T Am) , f 
Z-,m=l u n.,s m KJri 



< C\n. 
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Therefore the above inequality implies the following inequality 



sup 

f m eH 



EM r 
m=l Jn 



S~^M r 

JL/m=l Jn 



L 2 (U) 



< K 



2dC s + Ci + C? Vnmax(l, Vi, t/y/n), 



with probability 1 - exp(-t). Remind 4>' = K 2CiC* + C X + Cf 
assertion. 



then we obtain the 



Appendix E. Proof of Theorem [4] (minimax learning rate) 

Let the 5-packing number Q(5,H, ^(n)) of a function class H be the largest number of 
functions {/i, . . . , /q} C % such that \\fi — /j||L 2 (n) > 8 f° r an * 7^ j- 

P roof [Pro o f of Theo rem U] The proof utilizes the techniques developed by 



Raskutti et all (1200 
by lYang and Barronl ( 



2010) that applied the information theoretic technique developed 



1993 ) to the MKL settings. To simplify the notation, we write 



T := H^(R), N(e,H) := N(e,H, L 2 (U)) and Q(e,H) := Q(e,H, L 2 (Tl)). It can be eas 



i ly sho wn that Q(2e,J r ) < N(2e,J r ) < Q(e,J r ). Here due to Theorem 15 of lSteinwart et al 
( 20091 ) . Assumption O yields 



log N(e, H(l)) ~e~ 2s . (42) 
We utilize the following inequality given by Lemma 3 of iRaskutti et al. (|2009l V. 



mm 



_max_E||/-r||i 2(n) > 



log N(e n , 7) + ne 2 n /2cj 2 + log 2 
\ogQ{8 m F) 



First we show the assertion for the ^-norm ball: T-L^(R) = 'Re ac (R) := 
|/ = X^m=i fm maxi< m <A/ H/Tnllwrn ^ ^| ■ I n this situation, there is a constant C that 
depends only s such that 



]ogQ(6,7) > CMlogQ{5/VM,n{R)), logN(e,T) < Mlog N{e/VM,H(R)) 



(this is shown in Lemma 5 of Raskutti et al.1 ( 2010l ). but we give the proof in Lemma [T9l for 
completeness). Using this expression, the minimax-learning rate is bounded as 



min max Ell/ - f*\\j , m > — 1 
/ fe^ft) UJ J " L2(n) " 4 \ 



M log N(e n /y/M, H{R))+ ne 2 n /2a 2 + log 2 



4 V CMlogQ(5 n /VM,n(R)) 
Here we choose e n and <5 n to satisfy the following relations: 



4<M]ogN[e n /VM,n(R)), 



n 
2^' 



M log AT e n /VM,n(R) > log 2 



(43) 
(44) 
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41ogiV le n /VM,n(R)) <ClogQ[S n /VM,H(R) . (45) 



With e n and 5 n that satisfy the above relations (1431) and (j45j) . we have 



b 2 

mm max Mf - f*\\ 2 T (m > —■ ( 46 ) 



By Eq. (|42p . the relation (|43p can be rewritten as 



- ~ 2s 



r£^ < CM 



2^ » " VitVM, 
It is sufficient to impose 

el<Cn~^MRT^, (47) 

with a constant C. Since we have assumed that n > glinTp (= f° r II • Hv = II ' lUoo)' 

the conditions (|44[) can be satisfied if the constant C in Eq. (|47|) is taken sufficiently small 
so that we have 

log 2 < log N(s n /VM,H(R)) ~ . (48) 

The relation f)45|) can be satisfied by taking <5 n = ce n with an appropriately chosen constant 
c. Thus Eq. (j46j) gives 



2s 



mm max E\\f - f*\\ z L2(u) > Cn'~ MR~ , (49) 

with a constant C. This gives the assertion for p = oo. 

Finally we show the assertion for general isotropic ip-norm || • ||^. To show that, we 

prove that H too (R\\lU*/(cM)) C U^R). This is true if m }}f 1 G H^R) because of the 
second condition of the definition (fl~5j) of isotropic property. By the isotropic property, the 

-R||l|lv* .. isotr 5'P ic R 



■0-norm of — ^=$- 1 is bounded as 



-R||l||i/)* ^ 



cM 



1 ./. £ ^cM = R. 



cM 11 - cM 



Thus we have 1 G and thus ^(^^[^/(cM)) C Therefore we 

have 

min max Ell f — f*||? cm > min max E|| f — f*||? mi 

2s 

> C7n-TTIM (^f 1 ) , (V Eq. (09])). 

Note that due to the condition n > psVjfp - > Eq. (jUJ) is still valid under the condition 

ii n^,* 

that — is substituted into R in Eq. (|49|) (more precisely, Eq. (f45p is valid). Resetting 

2s 

C Cc !+ s , we obtain the assertion. ■ 
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Lemma 19 There is a constant C such that 

log Q(5,U loo {R)) > CM log Q(5/VM,U(R)), 
for sufficiently small 5. 

Proof The proof is analogous to that of Lemma 5 in lRaskutti et al. We describe the 

outline of the proof. Let N = Q(V26/VM,H(R)) and {fa, ...,/£} be a \/2<5/\/M-packing 
of H m {R). Then we can construct a function class T as 



M 



T = { f j = E ft I 3 = (h , • • • , 3M) € {1, . . . , N) 



M 



m=l 



We denote by [N] := {1, ... ,N}. For two functions P, / J E T, we have by the con- 
struction 



M 



25 2 



m=l m=l 

Thus, it suffices to construct a sufficiently large subset A C [iV] M such that all different 
pairs € A have at least M/2 of Hamming distance d,H(j,j') '■= ]Cm=i lL?m 7^ im]- 
Now we define dn(A,j) := min^/g^ oIhW, j)- If |A| satisfies 



M 



je[N] M d H (A,j)< — 



< |[AT] M | =N M , 



(50) 



then there exists a member j' G [JV]^ such that j' is more than 4^ away from ^4 with 
respect to d^, i.e. d,H{A,j') > 4f . That is, we can add j' to A as long as Eq. ([50]) holds. 
Now since 



je[iV] M d H (A,j)< 



M 



Eq. (|50|) holds as long as A satisfies 



A < -- 



2 ( M ^)7VM/2 

The logarithm of Q* can be evaluated as follows 



^ll,M/> M/2 ' 



:Q*. 



(51) 



log Q* = log 



M 



1 iV M 
27m 



M log N — log 2 — log 



M 
M/2 



M 



logiV 



> — log iV - log 2 - log 2 M > — lo 



M , JV 



16 



There exists a constant C such that N = Q{V25/y/M,H{R)) > CQ(5/VM,H(R)) because 
logQ(8,H(R)) ~ (4) S . Thus we obtain the assertion for sufficiently large N. ■ 
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Appendix F. Proof of Technical Lemmas 
F.l Proof of Lemma [2] 

Remind that Eq. ([7]) gives 



11/ -r 112 



£a(n) 



min < 




\ { r ™.}m=l : 


V r m >0 





-) 2 + (t' 

ail Vpi 



ml + MMMl . (52) 



We derive an upper bound of the right hand side by adding a constraint r m = r (Vm). Since 
s m = s (Vm), under the constraint r m = r (Vm) we have 

«2 3_ 7^II 1 HV* 1 inn 

-sr 1 L,* , 



sn" II 111,/.. 



/3 2 -^rii 1 ^* i 



/3l / 2s(3-s) ^/M 

3i/M r r 



Thus 2a = ga ) and Eq. ([52]) becomes 

ii/ - riLm = <? P ( mm L; + ft 2 + 2±,v urn-, iimj + ^IMM) }) . (53) 



By the definition, we see that the first two terms are monotonically decreasing function 
with respect to r and the third term is monotonically increasing function. The minimum 
of the right hand side is attained by balancing a 2 + f3\ and 2^s 2 r 2 ||1||^,* ||/*||^,- Since 
«i + 01 < 2 max (a 2 ,/3f), Eq. ([53]) indicates that 

11/ " r ll! 2( n) < O p ( mm ( 2 max («?,/??) + 2 i,V ||1||J. ||f ||J + V) . (54) 



- r m =r 



To balance the first term and the second term, we need to consider two situations: a 2 = 

^r 2 ||l||J.||ril}or^ = ^r a ||l||J.||rilJ. s 

First we balance the terms a 2 and jjs 2 r 2 ||1||^* H/*!)^ under the restriction that r m = 
r (Vm): 

a? = ^imij.[ir[i} 
* 9M^ = ^viii^iiri^ 

^ r - 1 = ( s /3)^M~lT?nTO7(||l||^ ||/*||^)^. (55) 
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For this r, we obtain 

a\ = 9M- 

n 

=9TT7 s ThM 1 -^n-^(\\l\\ r \\f*\\^ < 9M 1 -&n-&(\\l\\i,*\\f*\\ r i,)&, (56) 

2a 1 

where we used s 1+s < 1 and 9 1+s < 9 in the last inequality. 

Next we balance the terms (3f and ^js 2 r 2 ||1||^* ||/*||^ under the restriction that r m = 
r (Vm): 

/3i 2 = ^viiiiiJ. urns 

2s(3~s) 

9M^^ = i- s viii||^nri|2 

fl l+s -<« 

For this r, we obtain 

2s(3-s) 

0? = 9M- 



2 

n 1 + a 

l+s 2s(3-s) l-2s + s^ 2 2s(3-s) 

l-2s + s 2 2 2a(3-s) 

2s(3-s) l + s 

where we used s 1+4s_s2 < 1 and 9 1+4s_s2 < 9 in the last inequality. 
Therefore the right hand side of Eq. (|54p is further bounded as 

f*i|2 



11/ ~ /*lli 2 (n) 



<CU 4max<;9Af^i^n-i^(||l|| r ||r|^)TTi. 



1 2s 



2s(3-s) 



M log(M) 



O^M 1 -^n-TT7(||l||^||r||^)i^ + 

(I-") 2 2 2s(3-s) M1(WMV 



Finally, if n > (||1||^* ||/*||^,/M) 1 ~ 3 , the first term of the right hand side of this bound is 
not less than the second term: 

2s 1 2s 2 2a (3-a) 

M 1 " — n~— {\\lU*\\f*U)~ >M^^n \\f* U) T + 31 ^. 
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More precisely, with r given in Eq. ()55|) . the upper bound (j56j) of ct\ gives that, for 
n> (||l||^||/*||^/M)£j, we have 

2 „ 2 Mlog(M)) r-^^-J*. L. I, ,.„*,. Mlog(M) 



n max < a 



1-3 

M \ !+« 



-^1 (iiiii^nrw-v 



_?*_ Aflog(M) 

l+a \/ £A L_ 

n 



Thus by setting 



(n) 



18MTT7 n -TT7||l||^ s ||r||^ 1+s 



> ^ + £ 



then Theorem Q] gives that for all n and i' that satisfy - < 1 

S I 9 te)^ (11* V ^f^) 17(f) < & and for all t > 1, we have 



and 



II/-/ 



* II 2 



iL 2 (n) 



< 



24r/(t) 



2^2 



i 2s i 2s MIoe(M) 

18M 1 -— n" — (||l||^||r||^)— + 



n 



+ 4 x lSM^^-i+Fdiin^n^yi+j 
<C v (t) 2 (M^rT^iWlW^WrU)^ + 



(57) 



M log(M) 



?i 



with probability 1 — exp(— i) — exp(— t') where C is a sufficiently large 
constant depending on and /tjvf- Finally notice that the condition 

KM I \V™ J 



thus we can drop the condition 



log(A/) 



lll^liriW^V^M) ^(f) < £ automatically gives ^ < i, 



< 1. Then we obtain the assertion. 



F.2 Proof of Lemma [3] 

We assume 1 < p < oo and 1 < q < oo. The proof for the situations p = 1, oo or q = 1, oo 
is straight forward. First applying Holder's inequality twice, we obtain 

M' Mj 

( b , °) = E E h h ka h k 

3=1 fc=l 



M' 
3=1 



,fc=i 



< < 



A/' / M,- 
3=1 \fc=l 




('.' Holder's inequality) 



Holder's inequality). 



Therefore we obtain that 



Af Mj 

i%*<<!E(Eim p *)^ 

3 = 1 fc=l 



(58) 
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On the other hand, if we set 



then we have 



ctj }k = b 



■ , ,, x, - 



M< (Mi 

E E<s' 

5=1 U=i 



p / M; 



EC 



, fe=i 



21, I 

9 



A/' / Afj 



EE 6 !? 

j=l \fc=l 



•jf^-l+C 
^VP P 



M' (Mi } 

EE 1 * 

j'=i \fc=i 



9 3 ~M 1 9 

8*-l V P* 



1 



{e^(eS^)-}' 



i, 



and 



A/, 



%-l 

p 



EC 



, fc=i 



M' ( Mj 

E EC 

5=1 U=l 



M> ( M j' 

E EC 

5'=1 U=l 



1 




Therefore we obtain 



A/' 



E EC 

j'=i \k=i 



(59) 



Combining Eqs. (|59|) . ([59l) . we have \\b\\^* 
the assertion. 



E?:i(eS'i^) P * . Thus we obtain 
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F.3 Proof of Lemma [5] 

Remind that 



«i = 3 



r^ + M-iy o^l~ S a o 

,«2 = 3 — W,pi = 3 



n 



2s(3-s) j (l-a) 2 

> P2 = 3- 



Thus we have 



OC2 

a i 



n 



.2 2(l-s) 



2 



1 + s 



i 



n \ 2 2 ^ ! 

= — ^ ~ mm < s rf , 7 )• , 

rr 23 +A/-l | M — 1 f 



,2 2(l-«) 



and 



Pi 



2(l-s)" 
"3" 



7( 



2s(3-s) 

r t 1+a +M-1 



~ min < 



2(l-s) 2 
2 2 5 ' 1 

S ri ' M-l 



> . 



ft J 



Suppose r 1 2s > M — 1 and r x 



2s(3-s) 



'l 



l + s 



n !+ s 



2 

~ s r 



with the constraint for r x becomes 



2s(3-s) 

1+3 > M - 1, then we have a? ~ rf 2s n _1 , /3 2 = 

2 and — s 2 ^ 2 - Thus the minimization problem in Eq. ([7]) 

ecnmes 



mm 

n>0: 

2s(3-s) 

r~ 2s >M-l, rj 1+8 >M-1 



mm 

ri>0: 

2s(3-s) 

rf 2s >M-l, r a 1+s >A/-1 



a 2 + /3 2 + 



"2_\ 



wrwi 



r x 2s n 1 + r x 



2s(3-s) „ 

- i-H n-^+rf||r||J 



(60) 



2s(3-s) 



If we neglect the constraints r x 2s > M — 1 and r x 1+s > M — 1, the minimum is attained 



2s(3-s) 2 

at n (up to a constant factor) that satisfies max{r x 2s n^ 1 , r x 1+8 n _T +=} = i-e. 



ri = max < n 2 ( 1 +<0 



*ll , 1+s ,n"i+4s-s^ II f || , !+ 4 



1 + s 



nr 



1 and n > 



4s 

Therefore if n > ||/*||^," s (this is satisfied because ||/*||^ = M, \\f*\\e x = 

4s 1 L_ 

Mi-s is imposed), then the minimum is attained at r x = n 2 ( 1 + s ) ||/*|L 1+3 . Finally the 

l+s 2s(3-s) 

condition n > (Mlog(M))— yields that r x 2s > M-l and r x 1+s > M — 1 for r x = 
1 L_ 

n 2(i+s) || !+ a Therefore the constraints for r x in Eq. ()60f) can be removed. Summarizing 
the above discussions, we obtain 



mm 

{r m } m=1 : 

r m >0 



a 2 + /3 2 + 



«2 



- + ^ 



ft 



iimj 



— n i+s 11/ 



* II l + s 

114. 
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Thus we obtain the following convergence rates: 



ll/ (1 >-/*llU = O t (»-^M^ + M1 ^ M ) 



n 



\\f ioo) -r\\Uu) = o P [n-^ + 



Mlog(M) 



Now since n > (Mlog(M)) s , the above convergence rates can be simplified as 



" /liar, = Op (n-^M^) , ll/M - HI* = O p (n-TT-s) . 



ii 2 (n) - 

F.4 Proof of Lemma [6] (Derivation of Local Rademacher Complexity) 

For / G U® M , we define 



TT ... \\fh 2 (U) , im| , a Wfhm , a imi /Mlog(M) ||/||i 2( n) 

^n,*(/) := "1 ^==^ + «2 H/IU +#l 7==^ +/32||/|lv + ' 



'KM 



I KM 



n 



'K M 



Then by Eq. (|34p we obtain 



M 



m=l 

We know that there exists a constant (f) such that 

P I max sup n ^ > <jyq{t) I < e , 

(see Lemma [T71) . Let f/(i) := max-fyi, t/n}, and the event 5t be 

I n Si=l Pi/mix 



St := < ^(i) < max sup 



fm&Hm Un,s m (fm) 



< <Pv(t + 1) 



Then, by Eq. (|6ip . we have P(St) < e * for f > 1. Using this relation, we obtain 
following upper bound of the local Rademacher complexity: 



RuCh { ;\r)) 



-E 



sup 



feH^{R) 11 i=i 



1 - 

n — ^ 



i=0 



sup 



fen^(R) n i=i 



1 n 

n ^ — ^ 



Xj) I S t 



P(S t 



<E 



sup 



1 n 

-£*</( 



2?i) I 5n 



/6«^(il) U i=l 



+ £ E 



i n 

SUP - ^(Ti/(Xi) | 5 t 



S&i%\K) n i=l 



P(S t 
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<E 



<E 



M 



SUp ^<Mi(/m)|S 



feH ( j\R) m=l 



M 



sup ^^(t + l)[/(^(/ m )|5 t 



f£H ( j\R) m=l 



sup 4>U n ^(f) | So 



t=i 



SUp 077(t + l)C/n,*(/) I 



<^ ( ai— = + a 2 R + Pi—!== + /3 2 R ■ 



Since 



rj{t + l) e "* < jT (v^+T + ^) e-Wdt < 5, 



we obtain 



MKHR)) < 64> + a 2 R + + /3 2 i? + t ^-^= 



By re-setting tfi <— 6(f), we obtain the local Rademacher complexity upper bound. 
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